期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Scalable hardware architecture for disparity map computation and object location in real-time

Pedro Miguel Santos João Canas Ferreira José Silva Matos 《Journal of Real-Time Image Processing》2016,11(3):473-485

We present the disparity map computation core of a hardware system for isolating foreground objects in stereoscopic video streams. The operation is based on the computation of dense disparity maps using block-matching algorithms and two well-known metrics: sum of absolute differences and Census transform. Two sets of disparity maps are computed by taking each of the images as reference so that a consistency check can be performed to identify occluded pixels and eliminate spurious foreground pixels. Taking advantage of parallelism, the proposed architecture is highly scalable and provides numerous degrees of adjustment to different application needs, performance levels and resource usage. A version of the system for 640 × 480 images and a maximum disparity of 135 pixels was implemented in a system based on a Xilinx Virtex II-Pro FPGA and two cameras with a frame rate of 25 fps (less than the maximum supported frame rate of 40 fps on this platform). Implementation of the same system on a Virtex-5 FPGA is estimated to achieve 80 fps, while a version with increased parallelism is estimated to run at 140 fps (which corresponds to the calculation of more than 5.9 × 10⁹ disparity-pixels per second). 相似文献

2.

Pipelined architecture for real-time cost-optimized extraction of visual primitives based on FPGAs

F. Barranco M. Tomasi J. Díaz M. Vanegas E. Ros 《Digital Signal Processing》2013,23(2):675-688

This paper presents an architecture for the extraction of visual primitives on chip: energy, orientation, disparity, and optical flow. This cost-optimized architecture processes in real time high-resolution images for real-life applications. In fact, we present a versatile architecture that may be customized for different performance requirements depending on the target application. In this case, dedicated hardware and its potential on-chip implementation on FPGA devices become an efficient solution. We have developed a multi-scale approach for the computation of the gradient-based primitives. Gradient-based methods are very popular in the literature because they provide a very competitive accuracy vs. efficiency trade-off. The hardware implementation of the system is performed using superscalar fine-grain pipelines to exploit the maximum degree of parallelism provided by the FPGA. The system reaches 350 and 270 VGA frames per second (fps) for the disparity and optical flow computations respectively in their mono-scale version and up to 32 fps for the multi-scale scheme extracting all the described features in parallel. In this work we also analyze the performance in accuracy and hardware resources of the proposed implementation. 相似文献

3.

FPGA–DSP co-processing for feature tracking in smart video sensors

Matteo Tomasi Shrinivas Pundlik Gang Luo 《Journal of Real-Time Image Processing》2016,11(4):751-767

Motion estimation in videos is a computationally intensive process. A popular strategy for dealing with such a high processing load is to accelerate algorithms with dedicated hardware such as graphic processor units (GPU), field programmable gate arrays (FPGA), and digital signal processors (DSP). Previous approaches addressed the problem using accelerators together with a general purpose processor, such as acorn RISC machines (ARM). In this work, we present a co-processing architecture using FPGA and DSP. A portable platform for motion estimation based on sparse feature point detection and tracking is developed for real-time embedded systems and smart video sensors applications. A Harris corner detection IP core is designed with a customized fine grain pipeline on a Virtex-4 FPGA. The detected feature points are then tracked using the Lucas–Kanade algorithm in a DSP that acts as a co-processor for the FPGA. The hybrid system offers a throughput of 160 frames per second (fps) for VGA image resolution. We have also tested the benefits of our proposed solution (FPGA + DSP) in comparison with two other traditional architectures and co-processing strategies: hybrid ARM + DSP and DSP only. The proposed FPGA + DSP system offers a speedup of about 20 times and 3 times over ARM + DSP and DSP only configurations, respectively. A comparison of the Harris feature detection algorithm performance between different embedded processors (DSP, ARM, and FPGA) reveals that the DSP offers the best performance when scaling up from QVGA to VGA resolutions. 相似文献

4.

Architecture and applications of the FingerMouse: a smart stereo camera for wearable computing HCI 总被引：1，自引：0，他引：1

Patrick de la Hamette Gerhard Tröster 《Personal and Ubiquitous Computing》2008,12(2):97-110

In this paper we present a visual input HCI system for wearable computers, the FingerMouse. It is a fully integrated stereo camera and vision processing system, with a specifically designed ASIC performing stereo block matching at 5 Mpixel/s (e.g. QVGA 320 × 240 at 30 fps) and a disparity range of 47, consuming 187 mW (78 mW in the ASIC). It is button-sized (43 mm × 18 mm) and can be worn on the body, capturing the user’s hand and processing in real-time its coordinates as well as a 1-bit image of the hand segmented from the background. Alternatively, the system serves as a smart depth camera, delivering foreground segmentation and tracking, depth maps and standard images, with a processing latency smaller than 1 ms. This paper describes the FingerMouse functionality and its applications, and how the specific architecture outperforms other systems in size, latency and power consumption. 相似文献

5.

Accelerating embedded image processing for real time: a case study

Sol Pedre Tomáš Krajník Elías Todorovich Patricia Borensztejn 《Journal of Real-Time Image Processing》2016,11(2):349-374

Many image processing applications need real-time performance, while having restrictions of size, weight and power consumption. Common solutions, including hardware/software co-designs, are based on Field Programmable Gate Arrays (FPGAs). Their main drawback is long development time. In this work, a co-design methodology for processor-centric embedded systems with hardware acceleration using FPGAs is proposed. The goal of this methodology is to achieve real-time embedded solutions, using hardware acceleration, but achieving development time similar to that of software projects. Well established methodologies, techniques and languages from the software domain—such as Object-Oriented Paradigm design, Unified Modelling Language, and multithreading programming—are applied; and semiautomatic C-to-HDL translation tools and methods are used and compared. The methodology is applied to achieve an embedded implementation of a global vision algorithm for the localization of multiple robots in an e-learning robotic laboratory. The algorithm is specifically developed to work reliably 24/7 and to detect the robot’s positions and headings even in the presence of partial occlusions and varying lighting conditions expectable in a normal classroom. The co-designed implementation of this algorithm processes 1,600 × 1,200 pixel images at a rate of 32 fps with an estimated energy consumption of 17 mJ per frame. It achieves a 16× acceleration and 92 % energy saving, which compares favorably with the most optimized embedded software solutions. This case study shows the usefulness of the proposed methodology for embedded real-time image processing applications. 相似文献

6.

A low energy adaptive motion estimation hardware for H.264 multiview video coding

Yusuf Aksehir Kamil Erdayandi Tevfik Zafer Ozcan Ilker Hamzaoglu 《Journal of Real-Time Image Processing》2018,15(1):3-12

Multiview video coding (MVC) is the process of efficiently compressing stereo (two views) or multiview video signals. The improved compression efficiency achieved by H.264 MVC comes with a significant increase in computational complexity. Temporal prediction and inter-view prediction are the most computationally intensive parts of H.264 MVC. Therefore, in this paper, we propose novel techniques for reducing the amount of computations performed by temporal and inter-view predictions in H.264 MVC. The proposed techniques reduce the amount of computations performed by temporal and inter-view predictions significantly with very small PSNR loss and bit rate increase. We also propose a low energy adaptive H.264 MVC motion estimation hardware for implementing the temporal and inter-view predictions including the proposed computation reduction techniques. The proposed hardware is implemented in Verilog HDL and mapped to a Xilinx Virtex-6 FPGA. The FPGA implementation is capable of processing 30 × 8 = 240 frames per second (fps) of CIF (352 × 288) size eight view video sequence or 30 × 2 = 60 fps of VGA (640 × 480) size stereo (two views) video sequence. The proposed techniques reduce the energy consumption of this hardware significantly. 相似文献

7.

Embedded planar surface segmentation system for stereo images

Ninad Thakoor Jean Gao Sungyong Jung 《Machine Vision and Applications》2010,21(2):189-199

An embedded system is developed to segment stereo images using disparity. The recent developments in the embedded system architecture have allowed real time implementation of low-level vision tasks such as stereo disparity computation. At the same time, an intermediate level task such as segmentation is rarely attempted in an embedded system. To solve the planar surface segmentation problem, which is iterative in nature, our system implements a Segmentation–Estimation framework. In the segmentation phase, segmentation labels are assigned based on the underlying plane parameters. Connected component analysis is carried out on the segmentation result to select the largest spatially connected area for each plane. From the largest areas, the parameters for each plane are reestimated. This iterative process was implemented on TMS320DM642 based embedded system that operates at 3–5 frames per second on images of size 320 × 240. 相似文献

8.

Scalable Unified Transform Architecture for Advanced Video Coding Embedded Systems

Tiago Dias Sebastián López Nuno Roma Leonel Sousa 《International journal of parallel programming》2013,41(2):236-260

A novel high throughput and scalable unified architecture for the computation of the transform operations in video codecs for advanced standards is presented in this paper. This structure can be used as a hardware accelerator in modern embedded systems to efficiently compute all the two-dimensional 4 × 4 and 2 × 2 transforms of the H.264/AVC standard. Moreover, its highly flexible design and hardware efficiency allows it to be easily scaled in terms of performance and hardware cost to meet the specific requirements of any given video coding application. Experimental results obtained using a Xilinx Virtex-5 FPGA demonstrated the superior performance and hardware efficiency levels provided by the proposed structure, which presents a throughput per unit of area relatively higher than other similar recently published designs targeting the H.264/AVC standard. Such results also showed that, when integrated in a multi-core embedded system, this architecture provides speedup factors of about 120× concerning pure software implementations of the transform algorithms, therefore allowing the computation, in real-time, of all the above mentioned transforms for Ultra High Definition Video (UHDV) sequences (4,320 × 7,680 @ 30 fps). 相似文献

9.

Fine grain pipeline architecture for high performance phase-based optical flow computation

M. Tomasi F. Barranco M. Vanegas J. Díaz E. Ros 《Journal of Systems Architecture》2010,56(11):577-587

Accurate motion analysis of real life sequences is a very active research field due to its multiple potential applications. Currently, new technologies offer us very fast and accurate sensors that provide a huge quantity of data per second. Processing these data streams is very expensive (in terms of computing power) for general purpose processors and therefore, is beyond processing capabilities of most current embedded devices. In this work, we present a specific hardware architecture that implements a robust optical flow algorithm able to process input video sequences at a high frame rate and high resolution, up to 160 fps for VGA images. We describe a superpipelined datapath of more than 85 stages (some of them configured with superscalar units able to process several data in parallel). Therefore, we have designed an intensive parallel processing engine. System speed (frames per second) produces fine optical flow estimations (by constraining the actual motion ranges between consecutive frames) and the phase-based method confers the system robustness to image noise or illumination changes. In this work, we analyze the architecture of different frame rates and input image noise levels. We compare the results with other approaches in the state of the art and validate our implementation using several hardware platforms. 相似文献

10.

Real-time background generation and foreground object segmentation for high-definition colour video stream in FPGA device

Tomasz Kryjak Mateusz Komorkiewicz Marek Gorgon 《Journal of Real-Time Image Processing》2014,9(1):61-77

The processing of a high-definition video stream in real-time is a challenging task for embedded systems. However, modern FPGA devices have both a high operating frequency and sufficient logic resources to be successfully used in these tasks. In this article, an advanced system that is able to generate and maintain a complex background model for a scene as well as segment the foreground for an HD colour video stream (1,920 × 1,080 @ 60 fps) in real-time is presented. The possible application ranges from video surveillance to machine vision systems. That is, in all cases, when information is needed about which objects are new or moving in the scene. Excellent results are obtained by using the CIE Lab colour space, advanced background representation as well as integrating information about lightness, colour and texture in the segmentation step. Finally, the complete system is implemented in a single high-end FPGA device. 相似文献

11.

500-fps face tracking system

Idaku Ishii Tomoki Ichida Qingyi Gu Takeshi Takaki 《Journal of Real-Time Image Processing》2013,8(4):379-388

In this paper, we propose a high-speed vision system that can be applied to real-time face tracking at 500 fps using GPU acceleration of a boosting-based face tracking algorithm. By assuming a small image displacement between frames, which is a property of high-frame rate vision, we develop an improved boosting-based face tracking algorithm for fast face tracking by enhancing the Viola–Jones face detector. In the improved algorithm, face detection can be efficiently accelerated by reducing the number of window searches for Haar-like features, and the tracked face pattern can be localized pixel-wise even when the window is sparsely scanned for a larger face pattern by introducing skin color extraction in the boosting-based face detector. The improved boosting-based face tracking algorithm is implemented on a GPU-based high-speed vision platform, and face tracking can be executed in real time at 500 fps for an 8-bit color image of 512 × 512 pixels. In order to verify the effectiveness of the developed face tracking system, we install it on a two-axis mechanical active vision system and perform several experiments for tracking face patterns. 相似文献

12.

Racking focus and tracking focus on live video streams: a stereo solution

Zhan Yu Xuan Yu Christopher Thorpe Scott Grauer-Gray Feng Li Jingyi Yu 《The Visual computer》2014,30(1):45-58

The ability to produce dynamic Depth of Field effects in live video streams was until recently a quality unique to movie cameras. In this paper, we present a computational camera solution coupled with real-time GPU processing to produce runtime dynamic Depth of Field effects. We first construct a hybrid-resolution stereo camera with a high-res/low-res camera pair. We recover a low-res disparity map of the scene using GPU-based Belief Propagation, and subsequently upsample it via fast Cross/Joint Bilateral Upsampling. With the recovered high-resolution disparity map, we warp the high-resolution video stream to nearby viewpoints to synthesize a light field toward the scene. We exploit parallel processing and atomic operations on the GPU to resolve visibility when multiple pixels warp to the same image location. Finally, we generate racking focus and tracking focus effects from the synthesized light field rendering. All processing stages are mapped onto NVIDIA’s CUDA architecture. Our system can produce racking and tracking focus effects for the resolution of 640×480 at 15 fps. 相似文献

13.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献

14.

A multi-streaming SIMD multimedia computing engine

Jih-Ching Chiu Yu-Liang Chou 《Microprocessors and Microsystems》2010,34(7-8):247-258

Current multimedia extensions provide a mechanism for general-purpose processors to meet the growing performance demand of multimedia applications. However, the computing performance of these extensions is often limited for the design conceptions of the single data stream. This paper presents an architecture called “multi-streaming SIMD” that enables current multimedia extensions to simultaneously manipulate multiple data streams. To efficiently and flexibly realize the proposed architecture, an operation cell is designed by fusing the logic gates and the storage cells together. Multiple operation cells then are connected to compose a register file with the ability of performing SIMD operations called “Multimedia Operation Storage Unit (MOSU)”. Further, many MOSUs are used to compose a multi-streaming SIMD computing engine that can simultaneously manipulate multiple data streams and exploit the subword parallelisms of the elements in each data stream. This paper also designs three instruction modes (global, coupling, and isolated modes) for programmers to dynamically configure the multi-streaming SIMD computing engine at the instruction level to manipulate different amounts of data streams. Simulation results show that when the multi-streaming SIMD architecture has four 4-register MOSUs, it provides a factor of 3.3×–5.5× performance enhancement for traditional MMX extensions on 12 multimedia kernels. 相似文献

15.

Leveraging cost matrix structure for hardware implementation of stereo disparity computation using dynamic programming

W. James MacLean Siraj Sabihuddin Jamin Islam 《Computer Vision and Image Understanding》2010,114(11):1126-1138

Dynamic programming is a powerful method for solving energy minimisation problems in computer vision, for example stereo disparity computations. While it may be desirable to implement this algorithm in hardware to achieve frame-rate processing, a na?¨ve implementation may fail to meet timing requirements. In this paper, the structure of the cost matrix is examined to provide improved methods of hardware implementation. It is noted that by computing cost matrix entries along anti-diagonals instead of rows, the cost matrix entries can be computed in a pipelined architecture. Further, if only a subset of the cost matrix needs to be considered, for example by placing limits on the disparity range (include neglecting negative disparities by assuming rectified images), the resources required to compute the cost matrix in parallel can be reduced. Boundary conditions required to allow computing a subset of the cost matrix are detailed. Finally, a hardware solution of Cox’s maximum-likelihood, dynamic programming stereo disparity algorithm is implemented to demonstrate the performance achieved. The design provides high frame rate (>123 fps) estimates for a large disparity range (e.g. 128 pixels), for image sizes of 640 × 480 pixels, and can be simply extended to work well over 200 fps. 相似文献

16.

A fast face detection architecture for auto-focus in smart-phones and digital cameras一个快速的人脸检测架构用于智能手机和数码相机的自动对焦

Peng Ouyang Shouyi Yin Chenchen Deng Leibo Liu Shaojun Wei 《中国科学:信息科学(英文版)》2016,59(12):122402

Auto-focus is very important for capturing sharp human face centered images in digital and smart phone cameras. With the development of image sensor technology, these cameras support more and more highresolution images to be processed. Currently it is difficult to support fast auto-focus at low power consumption on high-resolution images. This work proposes an efficient architecture for an AdaBoost-based face-priority auto-focus. The architecture supports block-based integral image computation to improve the processing speed on high-resolution images; meanwhile, it is reconfigurable so that it enables the sub-window adaptive cascade classification, which greatly improves the processing speed and reduces power consumption. Experimental results show that 96% detection rate in average and 58 fps (frame per second) detection speed are achieved for the 1080p (1920×1080) images. Compared with the state-of-the-art work, the detection speed is greatly improved and power consumption is largely reduced. 相似文献

17.

面向3DTV的高分辨率摄像机阵列设计与实现

张茂军李乐包卫东谭树人《计算机应用》2008,28(7):1872-1874

当前摄像机阵列采集的视频分辨率与帧率普遍不高,还不能满足制作高质量3DTV节目的需要。设计与实现了一种高分辨率摄像机阵列,可以依据应用需要,自由配置为27fps帧率1280×1024分辨率,或93fps帧率640×480分辨率两种模式。讨论了该摄像机阵列的总体结构设计、摄像机单元设计、同步控制机制与实时存储技术等。相似文献

18.

FPGA based disparity map computation with vergence control

Christos Georgoulas Ioannis Andreadis 《Microprocessors and Microsystems》2010,34(7-8):259-273

Depth estimation in a scene using image pairs acquired by a stereo camera setup, is one of the important tasks of stereo vision systems. The disparity between the stereo images allows for 3D information acquisition which is indispensable in many machine vision applications. Practical stereo vision systems involve wide ranges of disparity levels. Considering that disparity map extraction of an image is a computationally demanding task, practical real-time FPGA based algorithms require increased device utilization resource usage, depending on the disparity levels operational range, which leads to significant power consumption. In this paper a new hardware-efficient real-time disparity map computation module is developed. The module constantly estimates the precisely required range of disparity levels upon a given stereo image set, maintaining this range as low as possible by verging the stereo setup cameras axes. This enables a parallel-pipelined design, for the overall module, realized on a single FPGA device of the Altera Stratix IV family. Accurate disparity maps are computed at a rate of more than 320 frames per second, for a stereo image pair of 640 × 480 pixels spatial resolution with a disparity range of 80 pixels. The presented technique provides very good processing speed at the expense of accuracy, with very good scalability in terms of disparity levels. The proposed method enables a suitable module delivering high performance in real-time stereo vision applications, where space and power are significant concerns. 相似文献

19.

Architecture design of the high-throughput compensator and interpolator for the H.265/HEVC encoder

Grzegorz Pastuszak Maciej Trochimiuk 《Journal of Real-Time Image Processing》2016,11(4):663-673

This paper presents the architecture of the high-throughput compensator and the interpolator used in the motion estimation of the H.265/HEVC encoder. The architecture can process 8×8 blocks in each clock cycle. The design allows the random order of checked coding blocks and motion vectors. This feature makes the architecture suitable for different search algorithms. The interpolator embeds 64 multiplierless reconfigurable filter cores to support computations for different fractional-pel positions. Synthesis results show that the design can operate at 200 and 400 MHz when implemented in FPGA Arria II and TSMC 90 nm, respectively. The computational scalability enables the proposed architecture to trade the throughput for the compression efficiency. If 2160p@30fps video is encoded, the design clocked at 400 MHz can check about 100 motion vectors for 8×8 blocks. 相似文献

20.

A reduced memory bandwidth and high throughput HDTV motion compensation decoder for H.264/AVC High 4:2:2 profile

Bruno Zatt Leandro M. de L. Silva Arnaldo Azevedo Luciano Agostini Altamiro Susin Sergio Bampi 《Journal of Real-Time Image Processing》2013,8(1):127-140

This article presents the HP422-MoCHA: optimized Motion Compensation hardware architecture for the High 4:2:2 profile of H.264/AVC video coding standard. The proposed design focuses on real-time decoding for HDTV 1080p (1,920 × 1,080 pixels) at 30 fps. It supports multiple sample bit-width (8, 9, or 10 bits) and multiple chroma sub-sampling formats (4:0:0, 4:2:0, and 4:2:2) to provide enhanced video quality experience. The architecture includes an optimized sample interpolator that processes luma and chroma samples in two parallel datapaths and features quarter sample accuracy, bi-prediction and weighted prediction. HP422-MoCHA also includes a hardwired Motion Vector Predictor, supporting temporal and spatial direct predictions. A novel memory hierarchy implemented as a 3-D Cache reduces the frame memory access, providing, on average, 62% of bandwidth and 80% of clock cycles reduction. The design was implemented in a Xilinx Virtex-II PRO FPGA, and also in an ASIC with a TSMC 0.18 μm standard cells technology. The ASIC implementation occupies 102 K equivalent gates and 56.5 KB of on-chip SRAM in a 3.8 × 3.4 mm² area. It presents a power consumption of 130 mW. Both implementations reach a maximum operation frequency of ~100 MHz, being able to motion compensate 37 bi-predictive frames or 69 predictive fps. The minimum required frequency to ensure the real-time decoding for HD1080p at 30 fps is 82 MHz. Since HP422-MoCHA is the first Motion Compensation architecture for the High 4:2:2 profile found in the literature, a Main profile MoCHA was used for comparison purposes, showing the highest throughput among all presented works. However, the HP422-MoCHA architecture also reaches the highest throughput when compared with the other published Main profile MC solutions, even considering the significantly higher complexity of the High 4:2:2 profile. 相似文献