首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Tracking systems are important in computervision, with applications in surveillance, human computer interaction, etc. Consumer graphics processing units (GPUs) have experienced an extraordinary evolution in both computing performance and programmability, leading to greater use of the GPU for non-rendering applications. In this work we propose a real-time object tracking algorithm, based on the hybridization of particle filtering (PF) and a multi-scale local search (MSLS) algorithm, presented for both CPU and GPU architectures. The developed system provides successful results in precise tracking of single and multiple targets in monocular video, operating in real-time at 70 frames per second for 640 × 480 video resolutions on the GPU, up to 1,100% faster than the CPU version of the algorithm.  相似文献   

2.
This paper presents a deep and extensive performance analysis of the particle filter (PF) algorithm for a very compute intensive 3D multi-view visual tracking problem. We compare different implementations and parameter settings of the PF algorithm in a CPU platform taking advantage of the multithreading capabilities of the modern processors and a graphics processing unit (GPU) platform using NVIDIA CUDA computing environment as developing framework. We extend our experimental study to each individual stage of the PF algorithm, and evaluate the quality versus performance trade-off among different ways to design these stages. We have observed that the GPU platform performs better than the multithreaded CPU platform when handling a large number of particles, but we also demonstrate that hybrid CPU/GPU implementations can run almost as fast as only GPU solutions.  相似文献   

3.
This paper proposed a multi-cue-based face-tracking algorithm with the supporting framework using parallel multi-core and one Graphic Processing Unit (GPU). Due to illumination and partial-occlusion problems, face tracking usually cannot stably work based on a single cue. Focusing on the above-mentioned problems, we first combined three different visual cues??color histogram, edge orientation histogram, and wavelet feature??under the framework of particle filters to considerably improve tracking performance. Furthermore, an online updating strategy made the algorithm adaptive to illumination changes and slight face rotations. Subsequently, attempting two parallel approaches resulted in real-time responses. However, the computational efficiency decreased considerably with the increase of particles and visual cues. In order to handle the large amount of computation costs resulting from the introduced multi-cue strategy, we explored two parallel computing techniques to speed up the tracking process, especially the most computation-intensive observational steps. One is a multi-core-based parallel algorithm with a MapReduce thread model, and the other is a GPU-based speedup approach. The GPU-based technique uses features-matching and particle weight computations, which have been put into the GPU kernel. The results demonstrate that the proposed face-tracking algorithm can work robustly with cluttered backgrounds and differing illuminations; the multi-core parallel scheme can increase the speed by 2?C6 times compared with that of the corresponding sequential algorithms. Furthermore, a GPU parallel scheme and co-processing scheme can achieve a greater increase in speed (8×?C12×) compared with the corresponding sequential algorithms.  相似文献   

4.
With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad‐core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S‐W algorithm in a 45‐nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power‐efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

5.
In this paper, we present an algorithm for providing visually-guided unmanned aerial vehicle (UAV) control using visual information that is processed on a mobile graphic processing unit (GPU). Most real-time machine vision applications for UAVs exploit low-resolution images because the shortage of computational resources comes from size, weight and power issue. This leads to the limitation that the data are insufficient to provide the UAV with intelligent behavior. However, GPUs have emerged as inexpensive parallel processors that are capable of providing high computational power in mobile environments. We present an approach for detecting and tracking lines that use a mobile GPU. Hough transform and clustering techniques were used for robust and fast tracking. We achieved accurate line detection and faster tracking performance using the mobile GPU as compared with an x86 i5 CPU. Moreover, the average results showed that the GPU provided approximately five times speedup as compared to an ARM quad-core Cortex-A15. We conducted a detailed analysis of the performance of proposed tracking and detection algorithm and obtained meaningful results that could be utilized in real flight.  相似文献   

6.
The GPU (Graphics Processing Unit) has recently become one of the most power efficient processors in embedded and many other environments, and has been integrated into more and more SoCs (System on Chip). Thus modern GPUs play a very important role in power aware computing. Strongly Connected Component (SCC) decomposition is a fundamental graph algorithm which has wide applications in model checking, electronic design automation, social network analysis and other fields. GPUs have been shown to have great potential in accelerating many types of computations including graph algorithms. Recent work have demonstrated the plausibility of GPU SCC decomposition, but the implementation is inefficient due to insufficient consideration of the distinguishing GPU programming model, which leads to poor performance on irregular and sparse graphs.This paper presents a new GPU SCC decomposition algorithm that focuses on full utilization of the contemporary embedded and desktop GPU architecture. In particular, a subgraph numbering scheme is proposed to facilitate the safe and efficient management of the subgraph IDs and to serve as the basis of efficient source selection. Furthermore, we adopt a multi-source partition procedure that greatly reduces the recursion depth and use a vertex labeling approach that can highly optimize the GPU memory access. The evaluation results show that the proposed approach achieves up to 41× speedup over Tarjan’s algorithm, one of the most efficient sequential SCC decomposition algorithms, and up to 3.8× speedup over the previous GPU algorithms.  相似文献   

7.
Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

8.
In this paper, we propose a high-speed vision system that can be applied to real-time face tracking at 500 fps using GPU acceleration of a boosting-based face tracking algorithm. By assuming a small image displacement between frames, which is a property of high-frame rate vision, we develop an improved boosting-based face tracking algorithm for fast face tracking by enhancing the Viola–Jones face detector. In the improved algorithm, face detection can be efficiently accelerated by reducing the number of window searches for Haar-like features, and the tracked face pattern can be localized pixel-wise even when the window is sparsely scanned for a larger face pattern by introducing skin color extraction in the boosting-based face detector. The improved boosting-based face tracking algorithm is implemented on a GPU-based high-speed vision platform, and face tracking can be executed in real time at 500 fps for an 8-bit color image of 512 × 512 pixels. In order to verify the effectiveness of the developed face tracking system, we install it on a two-axis mechanical active vision system and perform several experiments for tracking face patterns.  相似文献   

9.
In this paper, we present an algorithm and provide design improvements needed to port the serial Lempel–Ziv–Storer–Szymanski (LZSS), lossless data compression algorithm, to a parallelized version suitable for general purpose graphic processor units (GPGPU), specifically for NVIDIA’s CUDA Framework. The two main stages of the algorithm, substring matching and encoding, are studied in detail to fit into the GPU architecture. We conducted detailed analysis of our performance results and compared them to serial and parallel CPU implementations of LZSS algorithm. We also benchmarked our algorithm in comparison with well known, widely used programs: GZIP and ZLIB. We achieved up to 34× better throughput than the serial CPU implementation of LZSS algorithm and up to 2.21× better than the parallelized version.  相似文献   

10.
Gabor wavelet transform is one of the most effective texture feature extraction techniques and has resulted in many successful practical applications. However, real-time applications cannot benefit from this technique because of the high computational cost arising from the large number of small-sized convolutions which require over 10 min to process an image of 256 × 256 pixels on a dual core CPU. As the computation in Gabor filtering is parallelizable, it is possible and beneficial to accelerate the feature extraction process using GPU. Conventionally, this can be achieved simply by accelerating the 2D convolution directly, or by expediting the CPU-efficient FFT-based 2D convolution. Indeed, the latter approach, when implemented with small-sized Gabor filters, cannot fully exploit the parallel computation power of GPU due to the architecture of graphics hardware. This paper proposes a novel approach tailored for GPU acceleration of the texture feature extraction algorithm by using separable 1D Gabor filters to approximate the non-separable Gabor filter kernels. Experimental results show that the approach improves the timing performance significantly with minimal error introduced. The method is specifically designed and optimized for computing unified device architecture and is able to achieve a speed of 16 fps on modest graphics hardware for an image of 2562 pixels and a filter kernel of 322 pixels. It is potentially applicable for real-time applications in areas such as motion tracking and medical image analysis.  相似文献   

11.
The ability to produce dynamic Depth of Field effects in live video streams was until recently a quality unique to movie cameras. In this paper, we present a computational camera solution coupled with real-time GPU processing to produce runtime dynamic Depth of Field effects. We first construct a hybrid-resolution stereo camera with a high-res/low-res camera pair. We recover a low-res disparity map of the scene using GPU-based Belief Propagation, and subsequently upsample it via fast Cross/Joint Bilateral Upsampling. With the recovered high-resolution disparity map, we warp the high-resolution video stream to nearby viewpoints to synthesize a light field toward the scene. We exploit parallel processing and atomic operations on the GPU to resolve visibility when multiple pixels warp to the same image location. Finally, we generate racking focus and tracking focus effects from the synthesized light field rendering. All processing stages are mapped onto NVIDIA’s CUDA architecture. Our system can produce racking and tracking focus effects for the resolution of 640×480 at 15 fps.  相似文献   

12.
A neural architecture for texture classification running on the Graphics Processing Unit (GPU) under a stream processing model is presented in this paper. Textural features extraction is done in three different scales, it is based on the computations that take place on the mammalian primary visual pathway and incorporates both structural and color information. Feature vectors classification is done using a fuzzy neural network which introduces pattern analysis for orientation invariant texture recognition. Performance tests are done over a varying number of textures and the entire VisTex database. The intrinsic parallelism of the neural system led us to implement the whole architecture to run on GPUs, providing a speed-up between × 16 and × 25 for classifying textures of sizes 128 × 128 and 512 × 512 px with respect to an implementation on the CPU. A comparison of classification rates obtained with other methods is included and shows the great performance of the architecture. An average classification rate of 85.2% is obtained for 167 textures of size 512 × 512 px.  相似文献   

13.
在临床超声实时成像系统中组织运动情况是医生想要获取的重要诊断信息, 例如心脏运动. 基于线积分卷积的二维矢量场可视化技术可以同时展现运动矢量场的强度和方向. 但这一算法在处理时涉及大量的复杂计算, 尤其是流线追踪处理部分, 使其成为临床实时成像系统中的一大性能提升瓶颈. 为此研究并提出了一种基于新兴的高性能并行计算平台Fermi架构GPU(graphics processing unit图形处理单元)的并行运动可视化算法. 数据测试结果显示, 与基于CPU的实现相比, 采用Fermi架构的GPU处理不仅可  相似文献   

14.
In some warning applications, such as aircraft taking-off and landing, ship sailing, and traffic guidance in foggy weather, the high definition (HD) and rapid dehazing of images and videos is increasingly necessary. Existing technologies for the dehazing of videos or images have not completely exploited the parallel computing capacity of modern multi-core CPU and GPU, and leads to the long dehazing time or the low frame rate of video dehazing which cannot meet the real-time requirement. In this paper, we propose a parallel implementation and optimization method for the real-time dehazing of the high definition videos based on a single image haze removal algorithm. Our optimization takes full advantage of the modern CPU+GPU architecture, which increases the parallelism of the algorithm, and greatly reduces the computational complexity and the execution time. The optimized OpenCL parallel implementation is integrate into FFmpeg as an independent module. The experimental results show that for a single image, the performance of the optimized OpenCL algorithm is improved approximately 500% compared with the existing algorithm, and approximately 153% over the basic OpenCL algorithm. The 1080p (1920?×?1080) high definition hazy video can also processed at a real-time rate (more than 41 frames per second).  相似文献   

15.
In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

16.
Multi-dimensional visual tracking (MVT) problems include visual tracking tasks where the system state is defined by a high number of variables corresponding to multiple model components and/or multiple targets. A MVT problem can be modeled as a dynamic optimization problem. In this context, we propose an algorithm which hybridizes particle filters (PF) and the scatter search (SS) metaheuristic, called scatter search particle filter (SSPF), where the optimization strategies from SS are embedded into the PF framework. Scatter search is a population-based metaheuristic successfully applied to several complex combinatorial optimization problems. The most representative optimization strategies from SS are both solution combination and solution improvement. Combination stage enables the solutions to share information about the problem to produce better solutions. Improvement stage makes also possible to obtain better solutions by exploring the neighborhood of a given solution. In this paper, we have described and evaluated the performance of the scatter search particle filter (SSPF) in MVT problems. Specifically, we have compared the performance of several state-of-the-art PF-based algorithms with SSPF algorithm in different instances of 2D articulated object tracking problem and 2D multiple object tracking. Some of these instances are from the CVBase’06 standard database. Experimental results show an important performance gain and better tracking accuracy in favour of our approach.  相似文献   

17.
朱遵尚  刘肖琳 《计算机工程》2010,36(12):213-215
针对Harris角点检测精度和检测速度问题,利用现代图形处理器(GPU)对角点检测算法进行改进,提出一种基于GPU的快速亚像素Harris角点检测算法,该算法利用了GPU的并行处理能力和亚像素Harris角点检测算法的并行性特点。实验结果表明,对于分辨率为720×720的24 bit视频图像,该算法能够实现实时的亚像素级Harris角点检测。  相似文献   

18.
Membrane systems are parallel distributed computing models that are used in a wide variety of areas. Use of a sequential machine to simulate membrane systems loses the advantage of parallelism in Membrane Computing. In this paper, an innovative classification algorithm based on a weighted network is introduced. Two new algorithms have been proposed for simulating membrane systems models on a Graphics Processing Unit (GPU). Communication and synchronization between threads and thread blocks in a GPU are time-consuming processes. In previous studies, dependent objects were assigned to different threads. This increases the need for communication between threads, and as a result, performance decreases. In previous studies, dependent membranes have also been assigned to different thread blocks, requiring inter-block communications and decreasing performance. The speedup of the proposed algorithm on a GPU that classifies dependent objects using a sequential approach, for example with 512 objects per membrane, was 82×, while for the previous approach (Algorithm 1), it was 8.2×. For a membrane system with high dependency among membranes, the speedup of the second proposed algorithm (Algorithm 3) was 12×, while for the previous approach (Algorithm 1) and the first proposed algorithm (Algorithm 2) that assign each membrane to one thread block, it was 1.8×.  相似文献   

19.
Two crucial aspects of general-purpose embedded visual point tracking are addressed in this paper. First, the algorithm should reliably track as many points as possible. Second, the computation should achieve real-time video processing, which is challenging on low power embedded platforms. We propose a new multi-scale semi-dense point tracker called Video Extruder, whose purpose is to fill the gap between short-term, dense motion estimation (optical flow) and long-term, sparse salient point tracking. This paper presents a new detector, including a new salience function with low computational complexity and a new selection strategy that allows to obtain a large number of keypoints. Its density and reliability in mobile video scenarios are compared with those of the FAST detector. Then, a multi-scale matching strategy is presented, based on hybrid regional coarse-to-fine and temporal prediction, which provides robustness to large camera and object accelerations. Filtering and merging strategies are then used to eliminate most of the wrong or useless trajectories. Thanks to its high degree of parallelism, the proposed algorithm extracts beams of trajectories from the video very efficiently. We compare it with the state-of-the-art pyramidal Lucas–Kanade point tracker and show that, in short range mobile video scenarios, it yields similar quality results, while being up to one order of magnitude faster. Three different parallel implementations of this tracker are presented, on multi-core CPU, GPU and ARM SoCs. On a commodity 2010 CPU, it can track 8,500 points in a 640 × 480 video at 150 Hz.  相似文献   

20.
Recent graphics processing units (GPUs), which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map (EDM) with efficient memory access. Given a two-dimensional (2D) binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号