期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms

Svetislav Momcilovic Nuno Roma Leonel Sousa 《Journal of Real-Time Image Processing》2016,11(3):571-587

相似文献

2.

<Emphasis Type="Italic">Video Extruder</Emphasis>: a semi-dense point tracker for extracting beams of trajectories in real time

Matthieu?Garrigues Antoine?Manzanera Email author Thierry?M.?Bernard 《Journal of Real-Time Image Processing》2016,11(4):785-798

Two crucial aspects of general-purpose embedded visual point tracking are addressed in this paper. First, the algorithm should reliably track as many points as possible. Second, the computation should achieve real-time video processing, which is challenging on low power embedded platforms. We propose a new multi-scale semi-dense point tracker called Video Extruder, whose purpose is to fill the gap between short-term, dense motion estimation (optical flow) and long-term, sparse salient point tracking. This paper presents a new detector, including a new salience function with low computational complexity and a new selection strategy that allows to obtain a large number of keypoints. Its density and reliability in mobile video scenarios are compared with those of the FAST detector. Then, a multi-scale matching strategy is presented, based on hybrid regional coarse-to-fine and temporal prediction, which provides robustness to large camera and object accelerations. Filtering and merging strategies are then used to eliminate most of the wrong or useless trajectories. Thanks to its high degree of parallelism, the proposed algorithm extracts beams of trajectories from the video very efficiently. We compare it with the state-of-the-art pyramidal Lucas–Kanade point tracker and show that, in short range mobile video scenarios, it yields similar quality results, while being up to one order of magnitude faster. Three different parallel implementations of this tracker are presented, on multi-core CPU, GPU and ARM SoCs. On a commodity 2010 CPU, it can track 8,500 points in a 640 × 480 video at 150 Hz. 相似文献

3.

Feature tracking and matching in video using programmable graphics hardware 总被引：2，自引：0，他引：2

Sudipta N. Sinha Jan-Michael Frahm Marc Pollefeys Yakup Genc 《Machine Vision and Applications》2011,22(1):207-217

This paper describes novel implementations of the KLT feature tracking and SIFT feature extraction algorithms that run on the graphics processing unit (GPU) and is suitable for video analysis in real-time vision systems. While significant acceleration over standard CPU implementations is obtained by exploiting parallelism provided by modern programmable graphics hardware, the CPU is freed up to run other computations in parallel. Our GPU-based KLT implementation tracks about a thousand features in real-time at 30 Hz on 1,024 × 768 resolution video which is a 20 times improvement over the CPU. The GPU-based SIFT implementation extracts about 800 features from 640 × 480 video at 10 Hz which is approximately 10 times faster than an optimized CPU implementation. 相似文献

4.

Fast Gabor texture feature extraction with separable filters using GPU

Wai-Man Pang Kup-Sze Choi Jing Qin 《Journal of Real-Time Image Processing》2016,11(1):5-25

Gabor wavelet transform is one of the most effective texture feature extraction techniques and has resulted in many successful practical applications. However, real-time applications cannot benefit from this technique because of the high computational cost arising from the large number of small-sized convolutions which require over 10 min to process an image of 256 × 256 pixels on a dual core CPU. As the computation in Gabor filtering is parallelizable, it is possible and beneficial to accelerate the feature extraction process using GPU. Conventionally, this can be achieved simply by accelerating the 2D convolution directly, or by expediting the CPU-efficient FFT-based 2D convolution. Indeed, the latter approach, when implemented with small-sized Gabor filters, cannot fully exploit the parallel computation power of GPU due to the architecture of graphics hardware. This paper proposes a novel approach tailored for GPU acceleration of the texture feature extraction algorithm by using separable 1D Gabor filters to approximate the non-separable Gabor filter kernels. Experimental results show that the approach improves the timing performance significantly with minimal error introduced. The method is specifically designed and optimized for computing unified device architecture and is able to achieve a speed of 16 fps on modest graphics hardware for an image of 256² pixels and a filter kernel of 32² pixels. It is potentially applicable for real-time applications in areas such as motion tracking and medical image analysis. 相似文献

5.

Parallel implementation and optimization of high definition video real-time dehazing

Huailiang Tan Xiaofei He Zijian Wang Gaoming Liu 《Multimedia Tools and Applications》2017,76(22):23413-23434

In some warning applications, such as aircraft taking-off and landing, ship sailing, and traffic guidance in foggy weather, the high definition (HD) and rapid dehazing of images and videos is increasingly necessary. Existing technologies for the dehazing of videos or images have not completely exploited the parallel computing capacity of modern multi-core CPU and GPU, and leads to the long dehazing time or the low frame rate of video dehazing which cannot meet the real-time requirement. In this paper, we propose a parallel implementation and optimization method for the real-time dehazing of the high definition videos based on a single image haze removal algorithm. Our optimization takes full advantage of the modern CPU+GPU architecture, which increases the parallelism of the algorithm, and greatly reduces the computational complexity and the execution time. The optimized OpenCL parallel implementation is integrate into FFmpeg as an independent module. The experimental results show that for a single image, the performance of the optimized OpenCL algorithm is improved approximately 500% compared with the existing algorithm, and approximately 153% over the basic OpenCL algorithm. The 1080p (1920?×?1080) high definition hazy video can also processed at a real-time rate (more than 41 frames per second). 相似文献

6.

Fermi架构下超声成像组织运动可视化并行算法

何兴无《计算机系统应用》2013,22(4):147-152

在临床超声实时成像系统中组织运动情况是医生想要获取的重要诊断信息, 例如心脏运动. 基于线积分卷积的二维矢量场可视化技术可以同时展现运动矢量场的强度和方向. 但这一算法在处理时涉及大量的复杂计算, 尤其是流线追踪处理部分, 使其成为临床实时成像系统中的一大性能提升瓶颈. 为此研究并提出了一种基于新兴的高性能并行计算平台Fermi架构GPU(graphics processing unit图形处理单元)的并行运动可视化算法. 数据测试结果显示, 与基于CPU的实现相比, 采用Fermi架构的GPU处理不仅可相似文献

7.

Image processing acceleration for intelligent unmanned aerial vehicle on mobile GPU

Dongwoon Jeon Doo-Hyun Kim Young-Guk Ha Vladimir Tyan 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2016,20(5):1713-1720

In this paper, we present an algorithm for providing visually-guided unmanned aerial vehicle (UAV) control using visual information that is processed on a mobile graphic processing unit (GPU). Most real-time machine vision applications for UAVs exploit low-resolution images because the shortage of computational resources comes from size, weight and power issue. This leads to the limitation that the data are insufficient to provide the UAV with intelligent behavior. However, GPUs have emerged as inexpensive parallel processors that are capable of providing high computational power in mobile environments. We present an approach for detecting and tracking lines that use a mobile GPU. Hough transform and clustering techniques were used for robust and fast tracking. We achieved accurate line detection and faster tracking performance using the mobile GPU as compared with an x86 i5 CPU. Moreover, the average results showed that the GPU provided approximately five times speedup as compared to an ARM quad-core Cortex-A15. We conducted a detailed analysis of the performance of proposed tracking and detection algorithm and obtained meaningful results that could be utilized in real flight. 相似文献

8.

Deployment of parallel linear genetic programming using GPUs on PC and video game console platforms

Garnett Wilson Wolfgang Banzhaf 《Genetic Programming and Evolvable Machines》2010,11(2):147-184

相似文献

9.

Massively parallel Lucas Kanade optical flow for real-time video processing applications

Aurélien Plyer Guy Le Besnerais Frédéric Champagnat 《Journal of Real-Time Image Processing》2016,11(4):713-730

This paper deals with dense optical flow estimation from the perspective of the trade-off between quality of the estimated flow and computational cost which is required by real-world applications. We propose a fast and robust local method, denoted by eFOLKI, and describe its implementation on GPU. It leads to very high performance even on large image formats such as 4 K (3,840 × 2,160) resolution. In order to assess the interest of eFOLKI, we first present a comparative study with currently available GPU codes, including local and global methods, on a large set of data with ground truth. eFOLKI appears significantly faster while providing quite accurate and highly robust estimated flows. We then show, on four real-time video processing applications based on optical flow, that eFOLKI reaches the requirements both in terms of estimated flows quality and of processing rate. 相似文献

10.

George Teodoro Eduardo Valle Nathan Mariano Ricardo Torres Wagner Meira Jr Joel H. Saltz 《The VLDB Journal The International Journal on Very Large Data Bases》2014,23(3):427-448

Similarity search in high-dimensional spaces is a pivotal operation for several database applications, including online content-based multimedia services. With the increasing popularity of multimedia applications, these services are facing new challenges regarding (1) the very large and growing volumes of data to be indexed/searched and (2) the necessity of reducing the response times as observed by end-users. In addition, the nature of the interactions between users and online services creates fluctuating query request rates throughout execution, which requires a similarity search engine to adapt to better use the computation platform and minimize response times. In this work, we address these challenges with Hypercurves, a flexible framework for answering approximate k-nearest neighbor (kNN) queries for very large multimedia databases. Hypercurves executes in hybrid CPU–GPU environments and is able to attain massive query-processing rates through the cooperative use of these devices. Hypercurves also changes its CPU–GPU task partitioning dynamically according to the observed load, aiming for optimal response times. In our empirical evaluation, dynamic task partitioning reduced query response times by approximately 50 % compared to the best static task partition. Due to a probabilistic proof of equivalence to the sequential kNN algorithm, the CPU–GPU execution of Hypercurves in distributed (multi-node) environments can be aggressively optimized, attaining superlinear scalability while still guaranteeing, with high probability, results at least as good as those from the sequential algorithm. 相似文献

11.

FPGA–DSP co-processing for feature tracking in smart video sensors

Matteo Tomasi Shrinivas Pundlik Gang Luo 《Journal of Real-Time Image Processing》2016,11(4):751-767

Motion estimation in videos is a computationally intensive process. A popular strategy for dealing with such a high processing load is to accelerate algorithms with dedicated hardware such as graphic processor units (GPU), field programmable gate arrays (FPGA), and digital signal processors (DSP). Previous approaches addressed the problem using accelerators together with a general purpose processor, such as acorn RISC machines (ARM). In this work, we present a co-processing architecture using FPGA and DSP. A portable platform for motion estimation based on sparse feature point detection and tracking is developed for real-time embedded systems and smart video sensors applications. A Harris corner detection IP core is designed with a customized fine grain pipeline on a Virtex-4 FPGA. The detected feature points are then tracked using the Lucas–Kanade algorithm in a DSP that acts as a co-processor for the FPGA. The hybrid system offers a throughput of 160 frames per second (fps) for VGA image resolution. We have also tested the benefits of our proposed solution (FPGA + DSP) in comparison with two other traditional architectures and co-processing strategies: hybrid ARM + DSP and DSP only. The proposed FPGA + DSP system offers a speedup of about 20 times and 3 times over ARM + DSP and DSP only configurations, respectively. A comparison of the Harris feature detection algorithm performance between different embedded processors (DSP, ARM, and FPGA) reveals that the DSP offers the best performance when scaling up from QVGA to VGA resolutions. 相似文献

12.

500-fps face tracking system

Idaku Ishii Tomoki Ichida Qingyi Gu Takeshi Takaki 《Journal of Real-Time Image Processing》2013,8(4):379-388

In this paper, we propose a high-speed vision system that can be applied to real-time face tracking at 500 fps using GPU acceleration of a boosting-based face tracking algorithm. By assuming a small image displacement between frames, which is a property of high-frame rate vision, we develop an improved boosting-based face tracking algorithm for fast face tracking by enhancing the Viola–Jones face detector. In the improved algorithm, face detection can be efficiently accelerated by reducing the number of window searches for Haar-like features, and the tracked face pattern can be localized pixel-wise even when the window is sparsely scanned for a larger face pattern by introducing skin color extraction in the boosting-based face detector. The improved boosting-based face tracking algorithm is implemented on a GPU-based high-speed vision platform, and face tracking can be executed in real time at 500 fps for an 8-bit color image of 512 × 512 pixels. In order to verify the effectiveness of the developed face tracking system, we install it on a two-axis mechanical active vision system and perform several experiments for tracking face patterns. 相似文献

13.

基于GPU实时视频处理的多投影融合系统研究

李晓光刘宏哲袁家政《计算机科学》2015,42(9):285-288

介绍了GPU高速并行运算及其对数字图像、视频处理的重要作用。针对多通道环幕投影系统,采用CPU与GPU组合的异构计算结构,提出了一种视频实时处理方案。该方案通过DirectShow的链路模型保证了视频处理的灵活性,设计并采用可用于并行运算的几何校正、边缘融合算法,提升了视频处理的高效性。这一构架可以用于单通道4k格式视频的高质量效果展示,同时能有效降低构建成本,提高系统的经济实用性。相似文献

14.

Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU,GPU and FPGA

Dan Zou Yong Dou Fei Xia 《Concurrency and Computation》2012,24(14):1625-1644

With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad‐core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S‐W algorithm in a 45‐nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power‐efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

15.

Miss-aware LLC buffer management strategy based on heterogeneous multi-core

Fang Juan Zhang Xibei Liu Shijian Chang Zeqing 《The Journal of supercomputing》2019,75(8):4519-4528

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the last-level cache (LLC), the competition for LLC is more serious. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of LLC capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications have high number of concurrent threads and they can tolerate access latency. Taking into account the GPU program memory latency tolerance characteristics, we propose an LLC buffer management strategy (buffer-for-GPU, BFG) for heterogeneous multi-core. A buffer is added on the side of LLC to filtrate streaming requests of GPU. Cache-insensitive GPU messages directly access to buffer instead of accessing to LLC, thereby filtering the GPU request and freeing up the LLC space for the CPU application. Then, for the different characteristics of CPU and GPU applications, an improved LRU replacement taking into account the recent access time and access frequency of the cache block is adopted. The cache misses-aware algorithm dynamically selects the improved LRU or LRU algorithm to fit the current operating state by comparing the miss rate of cache in buffer so that the performance of the system will be improved significantly.

相似文献

16.

Performance of dynamic texture segmentation using GPU

Francisco Gómez Fernández María Elena Buemi Juan Manuel Rodríguez Julio C. Jacobo-Berlles 《Journal of Real-Time Image Processing》2016,11(2):375-383

This work is focused on the assessment of the use of GPU computation in dynamic texture segmentation under the mixture of dynamic textures (MDT) model. In this generative video model, the observed texture is a time-varying process commanded by a hidden state process. The use of mixtures in this model allows simultaneously handling of different visual processes. Nowadays, the use of GPU computing is growing in high-performance applications, but the adaptation of existing algorithms in such a way as to obtain a benefit from its use is not an easy task. In this paper, we made two implementations, one in CPU and the other in GPU, of a known segmentation algorithm based on MDT. In the MDT algorithm, there is a matrix inversion process that is highly demanding in terms of computing power. We make a comparison between the gain in performance obtained by porting to GPU this matrix inversion process and the gain obtained by porting to GPU the whole MDT segmentation process. We also study real-time motion segmentation performance by separating the learning part of the algorithm from the segmentation part, leaving the learning stage as an off-line process and keeping the segmentation as an online process. The results of performance analyses allow us to decide the cases in which the full GPU implementation of the motion segmentation process is worthwhile. 相似文献

17.

Kubernetes异构资源细粒度调度策略的设计与实现

刘志彬黄秋兰胡庆宝程耀东胡誉田浩来《计算机工程》2023,49(2):31-36+45

在异构资源环境中高效利用计算资源是提升任务效率和集群利用率的关键。Kuberentes作为容器编排领域的首选方案,在异构资源调度场景下调度器缺少GPU细粒度信息无法满足用户自定义需求,并且CPU/GPU节点混合部署下调度器无法感知异构资源从而导致资源竞争。综合考虑异构资源在节点上的分布及其硬件状态,提出一种基于Kubernetes的CPU/GPU异构资源细粒度调度策略。利用设备插件机制收集每个节点上GPU的详细信息,并将GPU资源指标提交给调度算法。在原有CPU和内存过滤算法的基础上,增加自定义GPU信息的过滤,从而筛选出符合用户细粒度需求的节点。针对CPU/GPU节点混合部署的情况,改进调度器的打分算法,动态感知应用类型,对CPU和GPU应用分别采用负载均衡算法和最小最合适算法,保证异构资源调度策略对不同类型应用的正确调度,并且在CPU资源不足的情况下充分利用GPU节点的碎片资源。通过对GPU细粒度调度和CPU/GPU节点混合部署情况下的调度效果进行实验验证,结果表明该策略能够有效进行GPU调度并且避免资源竞争。相似文献

18.

运动估计搜索算法的CUDA优化与实现

下载免费PDF全文

陈佐陈汉季加良《计算机工程与应用》2010,46(32):171-176

针对H.264压缩编码中计算量大以及最为耗时的运动估计搜索算法的特点,利用图形处理器的并行优化思想,研究基于CUDA计算平台的运动估计搜索算法GEA（全域消除算法）的并行化处理方法,并对其中的并行设计、数据处理、结果反馈等关键技术问题,进行了详细论述。最后通过实验数据对算法运行效率进行对比分析。实验结果表明GPU中的GEA搜索算法运动搜索性能较之CPU中有显著提高。相似文献

19.

基于GPU的实时亚像素Harris角点检测

下载免费PDF全文

朱遵尚刘肖琳《计算机工程》2010,36(12):213-215

针对Harris角点检测精度和检测速度问题,利用现代图形处理器(GPU)对角点检测算法进行改进,提出一种基于GPU的快速亚像素Harris角点检测算法,该算法利用了GPU的并行处理能力和亚像素Harris角点检测算法的并行性特点。实验结果表明,对于分辨率为720×720的24 bit视频图像,该算法能够实现实时的亚像素级Harris角点检测。相似文献

20.

基于GPU的H.264并行解码算法

陈鹏曹剑炜陈庆奎《计算机工程》2014,(1):283-286

针对并行处理H.264标准视频流解码问题,提出基于CPU/GPU的协同运算算法。以统一设备计算架构(CUDA)语言作为GPU编程模型,实现DCT逆变换与帧内预测在GPU中的加速运算。在保持较高计算精度的前提下,结合CUDA混合编程,提高系统的计算性能。利用NIVIDIA提供的CUDA语言,在解码过程中使DCT逆变换和帧内预测在GPU上并行实现,将并行算法与CPU单机实现进行比较,并用不同数量的视频流验证并行解码算法的加速效果。实验结果表明,该算法可大幅提高视频流的编解码效率,比CPU单机的平均计算加速比提高10倍。相似文献